Introduction

Study Topic & Questions of Interest

For our project we chose to perform an exploratory analysis and visualization on a dataset containing the historical auction prices of thousands of pieces of art. We chose this topic for a variety of reasons—whether you come from the arts and humanities, or economics and finance, art is an interesting and complex field. With new data starting to become available, for the first time we have the ability to analyze and better understand this historically opaque industry. It is a potential treasure trove for data scientists.

For hundreds of years fine art has been valued by cultures throughout the world for its aesthetic and beauty, but fine art as an asset class also appeals to many people today as an attractive alternative investment that is largely non-correlated to the traditional stock market. The fine art asset class is estimated to have a global market value of over $60 billion in annual turnover and has historically outperformed the S&P 500 in terms of compound annual growth (specifically, 15-year CAGR). Geographically, the art market has for decades been dominated by the United States, United Kingdom and China with the greatest concentration of art sales in the entire world occurring right here in New York City.

Each of our team members has a diverse background. In addition to experience as data scientists, a few of us have a background in economics and have worked for financial services companies. A few of us have also worked in the art industry for galleries or collectors. This topic was a cross-section of our experiences and interests that allowed us to explore timely questions about the prices and dynamics of the fine art market.

From a high level, the questions that intrigued us most were related to knowing more about the auction prices (called hammer price) for pieces of art in terms of different artists, location and time periods. We initially developed a large list of potential questions, and narrowed these down based on viability in terms of the data we had available as well as the scope and time constraints for the project. Some of the specific questions we decided to look at are:

Team Members & Contributions

Data Description

Overview

This dataset was provided by Arthena (https://arthena.com/). It was sourced from Sotheby’s historical auction data via the scraping of public web pages. The raw dataset includes 22711 rows of data. Each row represents an individual lot from an auction (a lot could be an individual painting, a sculpture, or sometimes even a collection of works). The raw dataset includes 25 columns of data. Each column represents a feature related to either that specific lot, the artist of the lot, or the auction where the lot was sold. Column definitions are listed below by category (Lot, Auction or Artist). For our analysis we also derive new columns from the original raw dataset. The derived column definitions are also listed below. Including these columns we ended with a total of 36 columns of data. The detailed logic and thought process is described more in the ‘Data Quality Analysis’ section.

art = read.csv("final_sothebys.csv", header=TRUE, na.strings=c("", NA))
#Save as a data frame
art_df <- as.data.frame(art)

#Dropping features with no interest (non-informative) in the analysis
drop <- c("X", "Unnamed..0", "provenance", "auction_house_id", "external_image_url", "literature", "end_date")
art_df <- art_df[, !(names(art_df) %in% drop)]

#Descriptive statistics
dim(art_df)
## [1] 22690    38
typeof(art_df)
## [1] "list"
str(art_df)
## 'data.frame':    22690 obs. of  38 variables:
##  $ lot_id                   : int  364930 364927 364919 364917 364911 365083 365081 365080 365079 365078 ...
##  $ auction_id               : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ lot_title                : Factor w/ 16710 levels "--------","... À TRAVERS SES LONGS CHEVEUX QUI LUI COUVRAIENT LA FIGURE, J'AI CRU RECONNAÎTRE AMMONARIA, LA TENTATION DE S"| __truncated__,..: 1892 5497 6217 12897 6400 2976 11443 9137 10825 435 ...
##  $ estimate_low             : num  35000 15000 25000 15000 3000 8000 5000 3000 2000 1500 ...
##  $ estimate_high            : num  45000 20000 35000 20000 5000 12000 7000 5000 3000 2500 ...
##  $ hammer_price_bp          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ currency                 : Factor w/ 4 levels "EUR","GBP","HKD",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ nth_in_auction           : int  20 17 10 8 3 173 171 170 169 168 ...
##  $ lot_number               : int  20 17 10 8 3 174 172 171 170 169 ...
##  $ condition                : Factor w/ 18760 levels "- cup on foot: a bump on the rim, some tiny holes on foot, foot with some slight damage, otherwise in good cond"| __truncated__,..: NA 6584 NA NA NA NA NA NA NA NA ...
##  $ auc_title                : Factor w/ 110 levels "'1 in 11' Auction",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ number_of_lots           : int  173 173 173 173 173 173 173 173 173 173 ...
##  $ location                 : Factor w/ 8 levels "AMSTERDAM","DOHA",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ start_date               : Factor w/ 234 levels "2006-03-31 15:15:23.000",..: 234 234 234 234 234 234 234 234 234 234 ...
##  $ auc_desc                 : Factor w/ 211 levels " RESULTS\n\nSotheby’s Contemporary Curated sale curated by Michael Clark achieved an outstanding total of £2.1 "| __truncated__,..: 188 188 188 188 188 188 188 188 188 188 ...
##  $ sale_id                  : Factor w/ 234 levels "B1701","F1013",..: 231 231 231 231 231 231 231 231 231 231 ...
##  $ artist_id                : int  247 245 121 96 28 363 362 361 360 357 ...
##  $ name                     : Factor w/ 8049 levels "'Antic et Toc', hommage à Nicolas Landau réalisé par Axel Vervoordt avec plusieurs objets ayant appartenus à Ni"| __truncated__,..: 6710 3094 6542 2698 3944 5222 4212 3823 5426 5053 ...
##  $ birth_year               : num  1882 1904 1891 1907 1939 ...
##  $ death_year               : Factor w/ 363 levels " 1610 Caserta)",..: 259 303 277 242 NA NA NA 217 215 337 ...
##  $ auc_date                 : Factor w/ 212 levels "2006-03-31 00:00:00.000",..: 212 212 212 212 212 212 212 212 212 212 ...
##  $ auc_year                 : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ auc_month                : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ auc_year_month           : Factor w/ 85 levels "2006-03","2006-04",..: 85 85 85 85 85 85 85 85 85 85 ...
##  $ auc_year_month_date      : Factor w/ 212 levels "2006-03-31","2006-04-08",..: 212 212 212 212 212 212 212 212 212 212 ...
##  $ auc_season_num           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ auc_season               : Factor w/ 4 levels "fall","spring",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ is_untitled              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ auc_order                : int  2 1 1 1 1 10 10 10 10 10 ...
##  $ percent_in_auction       : num  11.6 9.8 5.8 4.6 1.7 100 98.8 98.3 97.7 97.1 ...
##  $ estimate_avg             : num  40000 17500 30000 17500 4000 10000 6000 4000 2500 2000 ...
##  $ auth_era_num             : num  188 190 189 190 193 193 192 187 187 193 ...
##  $ auth_era                 : Factor w/ 52 levels "1140","1430",..: 35 37 36 37 40 40 39 34 34 40 ...
##  $ hammer_price_bp_usd      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ estimate_low_usd         : num  35000 15000 25000 15000 3000 8000 5000 3000 2000 1500 ...
##  $ estimate_high_usd        : num  45000 20000 35000 20000 5000 12000 7000 5000 3000 2500 ...
##  $ estimate_avg_usd         : num  40000 17500 30000 17500 4000 10000 6000 4000 2500 2000 ...
##  $ hammer_price_bp_usd_range: Factor w/ 4 levels "<$10,000","<$50,000",..: 4 4 4 4 4 4 4 4 4 4 ...
#summary(art_df)
#head(art_df, n=5)
#tail(art_df, n=5)

Column Definitions:

# Note that "provenance", "auction_house_id", "external_image_url", "literature" and "end_date" were dropped
names(art_df)
##  [1] "lot_id"                    "auction_id"               
##  [3] "lot_title"                 "estimate_low"             
##  [5] "estimate_high"             "hammer_price_bp"          
##  [7] "currency"                  "nth_in_auction"           
##  [9] "lot_number"                "condition"                
## [11] "auc_title"                 "number_of_lots"           
## [13] "location"                  "start_date"               
## [15] "auc_desc"                  "sale_id"                  
## [17] "artist_id"                 "name"                     
## [19] "birth_year"                "death_year"               
## [21] "auc_date"                  "auc_year"                 
## [23] "auc_month"                 "auc_year_month"           
## [25] "auc_year_month_date"       "auc_season_num"           
## [27] "auc_season"                "is_untitled"              
## [29] "auc_order"                 "percent_in_auction"       
## [31] "estimate_avg"              "auth_era_num"             
## [33] "auth_era"                  "hammer_price_bp_usd"      
## [35] "estimate_low_usd"          "estimate_high_usd"        
## [37] "estimate_avg_usd"          "hammer_price_bp_usd_range"

Lot
lot_id: a unique id for each lot.
lot_title: the title of the lot. A lot can sometimes consist of multiple pieces of art. We assume that 1 piece is 1 lot since that is most common.
estimate_low: the low-end auction price estimate for a lot, given by Sothebys.
estimate_high: the high-end auction price estimate for a lot, given by Sothebys.
hammer_price_bp: how much the lot was sold for at auction, plus buyers premium (a percentage fee taken by Sothebys and paid by the buyer).
currency: currency denomination for the price estimates and hammer price (limited to USD, EUR, GBP, HKD).
nth_in_auction: the order that the lot was presented in at auction.
lot_number: a number assigned to a lot for the given auction, different than nth_in_auction.
condition: description of the condition of the lot (messy text field - not used for our analysis).
provenance: description of who owned the lot previously (messy text field - not used for our analysis).
literature: different publications that the lot was mentioned in (messy text field - not used for our analysis).
external_image_url: link to the image (not used for our analysis).

Auction
auction_house_id: unique id for each auction house (for this dataset, all 1 since we are only using Sothebys data).
auction_id: unique id for each auction.
auc_title: title of the auction.
number_of_lots: total number of lots in the auction.
location: location where the auction was held.
start_date: start date of the auction.
end_date: end date of the auction (same as start date for this dataset).
auc_desc: description of the auction (messy text field - not used for our analysis).
sale_id: unique auction sale id assigned by Sothebys.

Artist
artist_id: unique id for each artist.
name: name of the artist.
birth_year: artist’s approximate birth year (messy text field - not used for our analysis).
death_year: artist’s approximate death year (messy text field - not used for our analysis).

Derived
estimate_avg: the average between estimate_low and estimate_high.
is_untitled: an indicator variable whether the name of the lot is “untitled” (in some language).
auc_year: the year of the auction (YYYY format).
auc_month: the month of the auction (as integers 1-12).
auc_season: the season of the auction (as integers 1-4).
auc_date: the date of the auction.
auc_order: the order in the auction by quantiles (as integers 1-10).
percent_in_auction: the percentage through an auction that a lot was shown (nth_in_auction divided by number_of_lots).
hammer_price_bp_usd: hammer_price_bp converted to usd.
estimate_high_usd: estimate_high converted to usd.
estimate_low_usd: estimate_low converted to usd.
estimate_avg_usd: estimate_avg converted to usd.
price_range: factor of bucketed hammer price ranges in usd.

# data types
sapply(art_df, class)
##                    lot_id                auction_id 
##                 "integer"                 "integer" 
##                 lot_title              estimate_low 
##                  "factor"                 "numeric" 
##             estimate_high           hammer_price_bp 
##                 "numeric"                 "numeric" 
##                  currency            nth_in_auction 
##                  "factor"                 "integer" 
##                lot_number                 condition 
##                 "integer"                  "factor" 
##                 auc_title            number_of_lots 
##                  "factor"                 "integer" 
##                  location                start_date 
##                  "factor"                  "factor" 
##                  auc_desc                   sale_id 
##                  "factor"                  "factor" 
##                 artist_id                      name 
##                 "integer"                  "factor" 
##                birth_year                death_year 
##                 "numeric"                  "factor" 
##                  auc_date                  auc_year 
##                  "factor"                 "integer" 
##                 auc_month            auc_year_month 
##                 "integer"                  "factor" 
##       auc_year_month_date            auc_season_num 
##                  "factor"                 "integer" 
##                auc_season               is_untitled 
##                  "factor"                 "integer" 
##                 auc_order        percent_in_auction 
##                 "integer"                 "numeric" 
##              estimate_avg              auth_era_num 
##                 "numeric"                 "numeric" 
##                  auth_era       hammer_price_bp_usd 
##                  "factor"                 "numeric" 
##          estimate_low_usd         estimate_high_usd 
##                 "numeric"                 "numeric" 
##          estimate_avg_usd hammer_price_bp_usd_range 
##                 "numeric"                  "factor"

Data Quality Analysis

Preprocessing

In this part of the project, we will be exploring the quality of the data provided for the auction data. Before analyzing the overall quality, a simple preprocessing has been conducted in Python that included the following steps:

  1. Added columns for normalized date (whole), year, month, and season.
  2. ‘Start Date’ and ‘End Date’ are always the same within the dataset. Therefore, we decided to only user ‘Start Date’ for the time when the auction occured.
  3. The “Title” column accepted a varity of title across different languages. We added a new column to show whether the piece is “untitles” while checking for the top 5 most common languages used: Italian, Dutch, English, French, Spanish, German.
  4. Created a new column that would break the order of the auction into 10 tiles and group all pieces into those tiles by auction id
  5. Added ‘Average Estimate’ column to reflect the aevrage price estimation between the high and the low
  6. Converted all currency to USD to be able to compare pieces sold across different locations. We matched the exchange rate at the time of the sell to properly convert all transactions.

A complete logic with code can be found here: INSERT LINK HERE

Data Quality Exploration

Let us start by looking at all the missing values and if there are any general patterns. Since we are dealing with a large amount of rows, we will reduce repeated patters to one row using the visna function.

#install.packages("extracat")
library(extracat)
visna(art_df, sort= "b")

Looking at the output above, we can conclude that the overall state of the dataset is relatively good. The second most common pattern has no missing values while the two most “problematic” features appear to be death_year and birth_year.

One important observation we will need to be careful about during further analysis is a relatively significant amount of missing values under hammer_price_bp feature. We have explored qualitative reasons behind the missing values in that category with the provider of the data. After careful observation, we found out that the missing values are actually driven by two locations that migt have less strict regulations around data governance (Doha). This observation is outlined in the graph below.

#install.packages("viridis")
library(viridis)
art_location <- art_df %>% gather(attribute, value, -location) %>% mutate(missing = ifelse(is.na(value), "yes", "no"))
ggplot(art_location, aes(x = location, y = attribute, fill = missing)) +
  geom_tile(color = "white") +
  ggtitle("Missing Values by Location") +
  xlab("Location") + ylab("Feature Name") +
  scale_fill_viridis(discrete=TRUE) +
  theme_bw()

This map allows us to see that some features’ missing values like hammer_price_bp, condition, and auc_desc are actually only missing in certain location (typically in pairs of two). Therefore, we can speculate that this data is MAR (missing at random) depending on a location feature.

Indeed, when looking at the table below, the overall percentage of missing values across difeerent locations varies quite significantly. Hong Kong is the top location with most missing values (8% of total).

art_location <- art_df %>% gather(attribute, value, -location)
percent_missing <- art_location %>% group_by(location) %>%
  summarise(num_na = sum(is.na(value)), total = n()) %>%
  mutate(percent_na = num_na/total)%>%
  arrange(-percent_na)
percent_missing
## # A tibble: 8 x 4
##   location  num_na  total percent_na
##   <fct>      <int>  <int>      <dbl>
## 1 AMSTERDAM   6638  81992     0.0810
## 2 HONG KONG   1189  16872     0.0705
## 3 PARIS       6156  94054     0.0655
## 4 MILAN       2866  51097     0.0561
## 5 DUBAI         90   1998     0.0450
## 6 NEW YORK   15565 357531     0.0435
## 7 LONDON      9708 233507     0.0416
## 8 DOHA          98   2479     0.0395

This leads to a logical question. Are there any other variables that could explain the missing data? Therefore, we also looked at simular graphs while grouping by year and month.

#install.packages("viridis")
#library(viridis)
art_year <- art_df %>% gather(attribute, value, -auc_year) %>% mutate(missing = ifelse(is.na(value), "yes", "no"))
year <- ggplot(art_year, aes(x = factor(auc_year), y = attribute, fill = missing)) +
  geom_tile(color = "white") + 
  ggtitle("Missing Values by Year") +
  xlab("year") + ylab("Feature Name") +
  scale_fill_viridis(discrete=TRUE) +
  theme_bw()
art_month <- art_df %>% gather(attribute, value, -auc_month) %>% mutate(missing = ifelse(is.na(value), "yes", "no"))
month <-ggplot(art_month, aes(x = factor(auc_month), y = attribute, fill =       
                                missing)) +
  geom_tile(color = "white") + 
  ggtitle("Missing Values by Month") +
  xlab("Month") + ylab("Feature Name") +
  scale_fill_viridis(discrete=TRUE) +
  theme_bw()
grid.arrange(year, month)

While month looks more or less equally distributed, there is an interesting pattern forming in the year view. It comes as no surprise that older years (2006-2009) would carry all of the missing values for features around description and condition. There also looks to be certain years (more like 3 year periods) with perfectly clean data: 2010-2012 and 2015-2017.

Next, let us look at the exact percentages and values for the overall missing data.

#install.packages("skimr")
library(skimr)
skimr::skim(art_df) %>% filter(stat =="missing") %>% arrange(desc(value)) %>% select(variable, value) %>% mutate(percent = value/nrow(art_df)) %>% filter (percent>0)
## # A tibble: 8 x 3
##   variable            value percent
##   <chr>               <dbl>   <dbl>
## 1 death_year          14806   0.653
## 2 birth_year           5676   0.250
## 3 auth_era_num         5676   0.250
## 4 hammer_price_bp      3898   0.172
## 5 hammer_price_bp_usd  3898   0.172
## 6 auc_desc             3079   0.136
## 7 condition            2808   0.124
## 8 lot_title            2469   0.109

Looking at the table, it actually appears that the estimates (along with its corresponding values generated during pre-processing) are only missing a very small amount of data (less than 1%). Therefore, such a small error can be easily filtered out in future analysis.

Nonetheless, hammer_price is missing nearly 20%. Besides what we saw when looking by location, our guess is that some auction ids are missing the hammer_price_bp in its entirety and hence the difference. It can be easily checked by looking at the aggregate table.

art_price <- art_df[, c("auction_id", "hammer_price_bp")]
percent_missing <- art_price %>% group_by(auction_id) %>%
  summarise(num_na = sum(is.na(hammer_price_bp)), total = n()) %>%
  mutate(percent_na = num_na/total)%>%
  arrange(-percent_na)
percent_missing %>% filter(percent_na>=0.4)
## # A tibble: 10 x 4
##    auction_id num_na total percent_na
##         <int>  <int> <int>      <dbl>
##  1          2    258   258      1.00 
##  2          3    173   173      1.00 
##  3        108     48    90      0.533
##  4        180      8    16      0.500
##  5        100    116   240      0.483
##  6        186      7    15      0.467
##  7        133     47   103      0.456
##  8         30     36    81      0.444
##  9         20    136   335      0.406
## 10         19     14    35      0.400

Looking at the data table, there is a significant number of auctions that are missing more than 10% of the price data (sometimes even 100%). Therefore, it must be that not only the locations but also the auction itself is a determining factor in missing hammer price value.

Lastly, we would like to check for correlation between average hammer_price_bp and percentage of missing values (since it is the one feature that matters the most in our analysis and is most prime to such correlations). Could it be that very high/low lots are simply not reported and, therefore, are missing more?

art_auction <- art_df[, c("auction_id", "hammer_price_bp")] %>% arrange(auction_id)
art_average <-art_auction %>% filter(!is.na(hammer_price_bp)) %>% 
  group_by(auction_id) %>% summarise(mean =mean(hammer_price_bp)) %>% arrange(auction_id)
percent_missing <- percent_missing %>% arrange(auction_id)
add_2 <- data.frame(auction_id=2, mean=0)
add_3 <- data.frame(auction_id=3, mean=0)
art_average <- rbind(art_average, add_2)
art_average <- rbind(art_average, add_3) %>% arrange(auction_id)
percent <- percent_missing[4]
average <- art_average[2]
auction_id <- percent_missing[1]
corr <- data.frame(auction_id=auction_id, percent=percent, average = average)
ggplot(corr, aes(average, percent)) + geom_point(col="blue") +ggtitle("Auction Average Hammer Price vs Percentage of Missing Values") 

There does not seem to be very strong correlation between the two variables.Just a subtle suggestion that auctions with smaller average price tend to have more NAs. Therefore, we will only consider location as the main determinant for missing values around hammer_price_bp.

Before we move on, let us quickly follow up on the action items from the data quality analysis.

  1. filter out NAs for price estimates
  2. take a note to exclude Doha and Hong Kong fro hammer_price_bp analysis
art_final <- art_df %>% filter(!is.na(estimate_low))

Main Analysis (Exploratory Data Analysis)

# dropping rows that have NAs in the hammer price column, which indicates that the lots are not sold
# drop the rows that are from doha, as indicated in the data quality analysis
df1 <- art_df %>% drop_na(hammer_price_bp_usd) 
df2 <- df1 %>%
  filter(abs(df1$hammer_price_bp_usd - 
               median(df1$hammer_price_bp_usd)) <=3*sd(df1$hammer_price_bp_usd)) %>%
  filter(location %in%  c("HONG KONG", "NEW YORK", "LONDON", "PARIS", "MILAN","DUBAI","AMSTERDAM"))

Auction Information

Insert elizabet’s part

Do Certain Lot Attributes Result in Higher Price?

Lot Titles

What lots have higher price?

df_wordcloud <- df1[,c("lot_title","hammer_price_bp_usd")]
df_wordcloud <- arrange(df_wordcloud,desc(df_wordcloud$hammer_price_bp_usd))[1:500,]
library(wordcloud)
library(tm)
pal <- brewer.pal(9, "OrRd")
pal <- pal[-(1:3)]
wordcloud(df_wordcloud$lot_title, df_wordcloud$hammer_price_bp_usd, min.freq=500, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)

#### What words appear more often in the lot titles?

# collapse the lot_title column by word and count the frequency they appear in the titles.
temp <- paste(df1$lot_title, collapse=' ' )
temp <- tolower(temp)
temp <- gsub(" *\\b[[:alpha:]]{1}\\b *", " ", temp)
temp <- gsub('[[:punct:] ]+',' ',temp)
temp <- as.list(strsplit(temp, " "))
temp <- unlist(temp)[!(unlist(temp) %in% stopwords("english"))]
temp <- unlist(temp)[!(unlist(temp) %in% "na")]
word_count <- na.omit(as.data.frame(table(temp)))
word_count <- arrange(word_count,desc(word_count$Freq))[1:300,]

# visualize word frequencies
pal <- brewer.pal(9, "Dark2")
wordcloud(word_count$temp, word_count$Freq, min.freq =20, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)

#### Looking at lots that have name “Untitled”, what price ranges are they in? Is it correlated?

library(vcd)
df1 <- df1 %>% 
  dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "<$50,000"))%>%
  dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "<$500,000"))%>%
   dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "$500,000+"))

vcd::mosaic(hammer_price_bp_usd_range~is_untitled, direction = c("v", "h"),df1,
       gp = gpar(fill = c("lightyellow", "lightpink")),
                 labeling = labeling_border(rot_labels = c(0, 45),pos_labels="center"))

#### Are they mostly contemprary?

#untitle_labels = read.csv("untitle_ratio.csv", header=TRUE)

Does the era of the lot affect its price?

df3 <- df2 %>% 
  filter(df2$birth_year>1800)
ggplot(df3, aes(birth_year,hammer_price_bp_usd))  + 
  geom_smooth(method='lm',formula=y ~ poly(x, 2))+
  geom_point(alpha = .1)  + 
  theme_grey(10)+scale_y_log10()+  geom_density_2d(bins = 5)

#box plot price vs year
#ggplot(df3, aes(x=auth_era), y=hammer_price_bp_usd)+ stat_summary(fun.y="mean", geom="line", aes(group=1))
ggplot(df3, aes(auth_era, hammer_price_bp_usd)) +
  geom_boxplot()+theme(axis.text.x = element_text(angle = 90, hjust = 1))+scale_y_log10()

## Do Certain Extrernal Factors Result in Higher Price? ###Does the Order Matter?

ggplot(df2, aes(percent_in_auction,hammer_price_bp_usd))  + 
  geom_smooth(method='lm',formula=y~x,color="red")+
  geom_point(alpha = .05)  + 
  theme_grey(10)+scale_y_log10()+
  facet_wrap(~location)

### Is there an impact from the financial crisis?

df1$auc_ymd <- as.Date(df1$auc_year_month_date)
art_yearfin <- df1 %>% group_by(month=lubridate::floor_date(auc_ymd, "month")) %>% summarise(revenue = mean(hammer_price_bp_usd))
ggplot() +geom_line(data=art_yearfin, aes(x=month, y=revenue/1000000))+ggtitle("Financial Crisis' Effect on average lot price ($M)")+ylab("average price")+xlab("Time")+theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_y_log10()

Let’s start by looking at the average lot prices of Sothebeys on a yearly scale. Our guess is that we are supposed to see a significant drop around the time of the financial crisis.

art_finance <- art_final[c("auc_year", "auc_month", "location","hammer_price_bp_usd")] %>% filter(!is.na(hammer_price_bp_usd))
art_yearfin <- art_finance %>% group_by(auc_year) %>% summarise(revenue = sum(hammer_price_bp_usd))
ggplot(art_yearfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year")

Indeed, we are seeing a big deep around late 2010. It is surprising to see that it took some time for the effect to reach the auction houses. Perhaps it is certain locations that are skewing some of the data. Let us try to facet the data by location and see if that could present us with a better outlook.

art_locfin <- art_finance %>% group_by(auc_year, location) %>% summarise(revenue = sum(hammer_price_bp_usd))
ggplot(art_locfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year") + facet_wrap(~location, scales ="free_y")

It looks like, indeed, some locations do not have consistent data and might be throwing the calculation off (Dubai, Doha, and Amsterdam). Let us tke a look at the same chart without those locations.

art_locfin <- art_finance %>% group_by(auc_year, location) %>% summarise(revenue = sum(hammer_price_bp_usd)) %>% filter(location %in%  c("HONG KONG", "NEW YORK", "LONDON", "PARIS", "MILAN"))
ggplot(art_locfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year") + facet_wrap(~location, scales ="free_y")

We notice a consistent drop in revenue across all locations above starting 2010. Therefore, our hypothesys must be correct: the financial crisis did have an effect on the auction revenue across the world (specifically significant drops are observed in New York and Hong Kong).

Does Season Matter?

vcd::mosaic(hammer_price_bp_usd_range~auc_season, direction = c("v", "h"),df1,
       gp = gpar(fill = c("lightyellow", "lightpink")),
                 labeling = labeling_border(rot_labels = c(0, 90),pos_labels="center"))

###Does Location Matter? # Executive Summary

Interactive Componenet